The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

نویسندگان

  • Siyuan Ma
  • Raef Bassily
  • Mikhail Belkin
چکیده

Stochastic Gradient Descent (SGD) with small mini-batch is a key component in modern large-scale machine learning. However, its efficiency has not been easy to analyze as most theoretical results require adaptive rates and show convergence rates far slower than that for gradient descent, making computational comparisons difficult. In this paper we aim to clarify the issue of fast SGD convergence. The key observation is that most modern architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on test data, these regimes allow for very fast convergence of SGD, comparable in the number of iterations to gradient descent. Specifically, consider the setting with quadratic objective function, or near a minimum, where the quadratic term is dominant. We show that: • Mini-batch size 1 with constant step size is optimal in terms of computations to achieve a given error. • There is a critical mini-batch size such that: – (linear scaling) SGD iteration with mini-batch size m smaller than the critical size is nearly equivalent to m iterations of mini-batch size 1. – (saturation) SGD iteration with mini-batch larger than the critical size is nearly equivalent to a gradient descent step. The critical mini-batch size can be viewed as the limit for effective mini-batch parallelization. It is also nearly independent of the data size, implying O(n) acceleration over GD per unit of computation. We give experimental evidence on real data, with the results closely following our theoretical analyses. Finally, we show how the interpolation perspective and our results fit with recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Walk with SGD

Exploring why stochastic gradient descent (SGD) based optimization methods train deep neural networks (DNNs) that generalize well has become an active area of research. Towards this end, we empirically study the dynamics of SGD when training over-parametrized DNNs. Specifically we study the DNN loss surface along the trajectory of SGD by interpolating the loss surface between parameters from co...

متن کامل

The Effect of Modified Collaborative Strategic Reading on EFL Learners' Reading Anxiety

The present study was an attempt to investigate the effectiveness of reading instructional approach called MCSR- Modified Collaborative Strategic Reading on reducing intermediate EFL learner's reading anxiety. Based on a pretest-posttest design, MCSR was implemented with 64 EFL learners at intermediate level. They received EFL reading instruction according to MCSR over two and a half months. A ...

متن کامل

Gradient Descent Quantizes ReLU Network Features

Deep neural networks are often trained in the over-parametrized regime (i.e. with far more parameters than training examples), and understanding why the training converges to solutions that generalize remains an open problem Zhang et al. [2017]. Several studies have highlighted the fact that the training procedure, i.e. mini-batch Stochastic Gradient Descent (SGD) leads to solutions that have s...

متن کامل

Small group discussion for medical students to learning embryology

Background: One of the most important issues is the best method to teach embryology course to medical students. Small group discussion (SGD) were used to working together, integral to learningdeveloping intellectual skills and interactive learning experience. Methods: The 72 medical students were equally randomizedto the SGD (group I) and usual lecture based teaching (LBT), (group II) in genera...

متن کامل

To understand deep learning we need to understand kernel learning

Generalization performance of classifiers in deep learning has recently become a subject of intense study. Deep models, which are typically heavily over-parametrized, tend to fit the training data exactly. Despite this overfitting, they perform well on test data, a phenomenon not yet fully understood. The first point of our paper is that strong performance of overfitted classifiers is not a uni...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1712.06559  شماره 

صفحات  -

تاریخ انتشار 2017